Add MIG validations including RHEL setup#21
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds gpu_validation_mode and related defaults; installs/configures NVIDIA CUDA repo and container tooling on RHEL for MIG; adds MIG detection and assertions; prepares RHEL kernel tooling for MIG; and wires mode-specific flavor extra_specs and an optional vllm service flag. ChangesGPU Validation Mode Configuration and MIG Support
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
gpu-validation/tasks/gpus.yaml (1)
12-15: ⚡ Quick winMake A30 GPU detection more specific.
The string search for
"A30"could match unintended substrings in lspci output (e.g., "GA30", "A300", or other device descriptions). Consider combining the NVIDIA and A30 checks or using a more specific pattern.♻️ Recommended fix to improve detection accuracy
- name: Set found_A30 to true if A30 is found ansible.builtin.set_fact: found_a30: true - when: lspci_output.stdout is search("A30") + when: lspci_output.stdout is search("NVIDIA.*A30")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gpu-validation/tasks/gpus.yaml` around lines 12 - 15, The current GPU detection sets found_a30 based on a loose substring match of lspci_output.stdout for "A30"; tighten it by changing the conditional on the ansible.builtin.set_fact that sets found_a30 to use a more specific regex that ensures both the vendor and model are present (e.g., require "NVIDIA" near "A30") or use word boundaries and case-insensitivity to avoid partial matches; update the when clause referencing lspci_output.stdout and the found_a30 task to use that stricter search (e.g., a regex like a vendor+model match or \bA30\b with (?i) for case-insensitive matching).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@gpu-validation/defaults/main.yaml`:
- Line 28: The pci passthrough alias is hardcoded to "gpu-l4:{{
gpu_validation_flavor_gpus }}" which assumes a preconfigured Nova PCI alias;
make the alias configurable and provide guidance/tests: introduce a new variable
(e.g., gpu_validation_pci_alias) and replace the hardcoded "gpu-l4" in the
"pci_passthrough:alias" value with that variable (falling back to a sensible
default or empty to disable the extra-spec), update references to the template
key "pci_passthrough:alias" and the variable "gpu_validation_flavor_gpus", and
add a short note or unit test that validates the variable is set or documents
required Nova pci alias mappings.
In `@gpu-validation/tasks/nvidia-cuda-repos.yaml`:
- Around line 18-21: The task "Reboot the VM to find the installed drivers" in
gpu-validation/tasks/nvidia-cuda-repos.yaml is issuing an unconditional
ansible.builtin.reboot after installing nvidia-container-toolkit; remove or make
that reboot conditional so it doesn't force downtime. Replace the unconditional
reboot with either removal of the reboot task or wrap it with the same check
used by reboot_if_needed.yaml (the needs-restarting -r check / driver-reboot
condition), so the reboot runs only when needs-restarting indicates a
kernel/driver restart is required.
- Around line 40-41: The task that runs "nvidia-ctk runtime configure
--runtime=containerd" can fail on hosts without containerd or
/etc/containerd/config.toml; update gpu-validation/tasks/nvidia-cuda-repos.yaml
to either (a) ensure containerd is installed and configured before this task by
adding the containerd install/configure steps, or (b) add a guard that checks
for containerd presence (e.g., ansible.builtin.stat on /usr/bin/containerd or
/etc/containerd/config.toml and/or ansible.builtin.service status for
containerd) and only runs the nvidia-ctk command when that check succeeds
(otherwise skip or fail with a clear message).
In `@requirements.yaml`:
- Line 8: The git dependency
"git+https://github.com/openstack-k8s-operators/ci-framework.git" must be pinned
to a specific tag or commit to ensure reproducible installs; update that entry
by appending a tag or commit reference (e.g., use
"git+https://github.com/openstack-k8s-operators/ci-framework.git@v1.0.0" or a
commit SHA) so the requirements.yaml points to an immutable release instead of
the floating HEAD.
---
Nitpick comments:
In `@gpu-validation/tasks/gpus.yaml`:
- Around line 12-15: The current GPU detection sets found_a30 based on a loose
substring match of lspci_output.stdout for "A30"; tighten it by changing the
conditional on the ansible.builtin.set_fact that sets found_a30 to use a more
specific regex that ensures both the vendor and model are present (e.g., require
"NVIDIA" near "A30") or use word boundaries and case-insensitivity to avoid
partial matches; update the when clause referencing lspci_output.stdout and the
found_a30 task to use that stricter search (e.g., a regex like a vendor+model
match or \bA30\b with (?i) for case-insensitive matching).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: cafbb32f-07d3-429e-bb08-8523477d29a3
📒 Files selected for processing (11)
gpu-validation/defaults/main.yamlgpu-validation/tasks/gpus.yamlgpu-validation/tasks/main.yamlgpu-validation/tasks/nvidia-cuda-repos.yamlgpu-validation/tasks/nvidia.yamlgpu-validation/tasks/nvidia_assertions.yamlgpu-validation/tasks/setup.yamlgpu-validation/tasks/vm_image.yamlgpu-validation/templates/vllm-serve.service.j2main.yamlrequirements.yaml
| - name: Configure NVIDIA container runtime | ||
| ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for containerd installation tasks
echo "=== Checking for containerd installation ==="
rg -n "name:.*containerd" --type yaml -C 2
# Check if containerd is a dependency or installed via dnf/yum
rg -n "containerd" --type yaml -C 3
# Check nvidia-ctk documentation or usage for runtime requirements
rg -n "nvidia-ctk.*runtime.*configure" --type yaml -C 3Repository: rhos-vaf/gpu-validation
Length of output: 1212
Add a containerd dependency/presence check before configuring NVIDIA container runtime
gpu-validation/tasks/nvidia-cuda-repos.yaml runs nvidia-ctk runtime configure --runtime=containerd, but the repository’s YAML does not show any containerd install/presence step—only this containerd-related usage was found—so the command can fail on hosts without containerd and/or /etc/containerd/config.toml.
Ensure containerd is installed (and the expected config exists) before this task, or add a guard (e.g., stat/service check) to fail clearly or skip.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@gpu-validation/tasks/nvidia-cuda-repos.yaml` around lines 40 - 41, The task
that runs "nvidia-ctk runtime configure --runtime=containerd" can fail on hosts
without containerd or /etc/containerd/config.toml; update
gpu-validation/tasks/nvidia-cuda-repos.yaml to either (a) ensure containerd is
installed and configured before this task by adding the containerd
install/configure steps, or (b) add a guard that checks for containerd presence
(e.g., ansible.builtin.stat on /usr/bin/containerd or
/etc/containerd/config.toml and/or ansible.builtin.service status for
containerd) and only runs the nvidia-ctk command when that check succeeds
(otherwise skip or fail with a clear message).
| - name: community.general | ||
| version: ">=10.0.0" | ||
| - name: git+https://github.com/openstack-k8s-operators/edpm-ansible.git | ||
| - name: git+https://github.com/openstack-k8s-operators/ci-framework.git |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Pin ci-framework collection to a specific version.
The git dependency lacks version pinning (tag, branch, or commit SHA), which can cause non-reproducible builds and unexpected breakage if the upstream repository changes.
📌 Recommended fix to pin the dependency
- - name: git+https://github.com/openstack-k8s-operators/ci-framework.git
+ - name: git+https://github.com/openstack-k8s-operators/ci-framework.git,mainOr better yet, pin to a specific tag or commit:
- name: git+https://github.com/openstack-k8s-operators/ci-framework.git,v1.0.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: git+https://github.com/openstack-k8s-operators/ci-framework.git | |
| - name: git+https://github.com/openstack-k8s-operators/ci-framework.git,main |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@requirements.yaml` at line 8, The git dependency
"git+https://github.com/openstack-k8s-operators/ci-framework.git" must be pinned
to a specific tag or commit to ensure reproducible installs; update that entry
by appending a tag or commit reference (e.g., use
"git+https://github.com/openstack-k8s-operators/ci-framework.git@v1.0.0" or a
commit SHA) so the requirements.yaml points to an immutable release instead of
the floating HEAD.
There was a problem hiding this comment.
edpm-ansible is not pinned - following that convention.
There was a problem hiding this comment.
@rlandy, fair point — edpm-ansible being unpinned establishes the convention for this repo, and ci-framework is consistent with that. I'll withdraw the suggestion.
(∪ ・ ω ・)∪
| - name: community.general | ||
| version: ">=10.0.0" | ||
| - name: git+https://github.com/openstack-k8s-operators/edpm-ansible.git | ||
| - name: git+https://github.com/openstack-k8s-operators/ci-framework.git |
There was a problem hiding this comment.
why do we want ci-framework as a dependency?
There was a problem hiding this comment.
I think it is - as that is the role we are using to install RHEL repos
| hosts: target_nodes | ||
| gather_subset: | ||
| - min | ||
| gather_facts: true |
There was a problem hiding this comment.
nit: was it missing some facts outside the min scope?
There was a problem hiding this comment.
reverting back to min
| found_nvidia: true | ||
| when: lspci_output.stdout is search("NVIDIA") | ||
|
|
||
| - name: Set found_A30 to true if A30 is found |
There was a problem hiding this comment.
please drop as no longer used
| path: /etc/cdi/nvidia.yaml | ||
| register: nvidia_driver_cdi_config_file | ||
|
|
||
| - name: Configure NVIDIA container runtime |
There was a problem hiding this comment.
These (Check if CDI configfile exists and Configure NVIDIA container runtime) can be part of a configuration task after Install NVIDIA CUDA repos and Reboot if system updates require it tasks
There was a problem hiding this comment.
Starting by leaving the reboot where it is - let's see if the tasks can get by with it there.
There was a problem hiding this comment.
reboot was required - conditional reboot task did not work here - added reboot back.
| gpgcheck: 1 | ||
| gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-{{ ansible_facts['distribution_major_version'] }}" | ||
|
|
||
| - name: Install DKMS (Not supported) |
There was a problem hiding this comment.
Nit, let's leave the label (UNSUPPORTED) as standard
rlandy
left a comment
There was a problem hiding this comment.
Responding to comments
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
gpu-validation/tasks/setup.yaml (2)
28-35:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPin kernel-devel and kernel-headers to the running kernel version.
Installing
kernel-develandkernel-headerswithout version constraints will fetch the latest available versions, which may not match the currently running kernel (ansible_kernel). DKMS module compilation requires an exact version match between the running kernel and the installed headers. A mismatch will cause the NVIDIA driver installation (viaosp.edpm.edpm_accel_drivers) to fail, breaking the entire MIG validation flow.This scenario is realistic: a system installed with an older kernel that hasn't rebooted after newer kernel packages became available in the repositories.
🔧 Proposed fix to pin kernel packages to running kernel version
- name: Install DKMS (UNSUPPORTED) ansible.builtin.dnf: use_backend: dnf4 name: - dkms - - kernel-devel - - kernel-headers + - "kernel-devel-{{ ansible_kernel }}" + - "kernel-headers-{{ ansible_kernel }}" state: present🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gpu-validation/tasks/setup.yaml` around lines 28 - 35, The task "Install DKMS (UNSUPPORTED)" installs kernel-devel and kernel-headers unpinned, which can pull headers for a different kernel than the running one; change the package names to pin to the current running kernel using the ansible fact (ansible_kernel) so you install kernel-devel-{{ ansible_kernel }} and kernel-headers-{{ ansible_kernel }} (leave dkms unpinned), ensuring the playbook gathers facts before this task and that the package list in the ansible.builtin.dnf task uses those suffixed names so DKMS can compile against the exact running kernel.
14-17:⚠️ Potential issue | 🟠 MajorFix kernel-mismatch risk and validate the external role task file
gpu-validation/tasks/setup.yamlinstallskernel-devel/kernel-headerswithout pinning to the running kernel; DKMS builds typically require an exact kernel-devel match (add version pinning toansible_kerneland/or ensure a reboot-before-DKMS flow).gpu-validation/tasks/setup.yamlincludescifmw.general.repo_setupwithtasks_from: rhos_release; local code doesn’t contain this role/tasks, and inci-frameworktherhos_release.ymlusage appears associated with thereproducerrole—confirm thatcifmw.general.repo_setupactually contains arhos_release(orrhos_release.yml) task file so the include doesn’t fail at runtime.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gpu-validation/tasks/setup.yaml` around lines 14 - 17, In gpu-validation/tasks/setup.yaml the play installs kernel-devel/kernel-headers without pinning to the running kernel (risking DKMS build failures) and includes_role name cifmw.general.repo_setup with tasks_from: rhos_release which may not exist; update the package installation step to install kernel-devel-{{ ansible_kernel }} and kernel-headers-{{ ansible_kernel }} (or otherwise ensure a reboot-before-DKMS flow) and verify/adjust the include_role usage: confirm that the cifmw.general.repo_setup role actually contains a rhos_release (or rhos_release.yml) task file and change tasks_from to the correct task name or remove the tasks_from so the role loads its default tasks to avoid runtime failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@gpu-validation/tasks/setup.yaml`:
- Around line 28-35: The task "Install DKMS (UNSUPPORTED)" installs kernel-devel
and kernel-headers unpinned, which can pull headers for a different kernel than
the running one; change the package names to pin to the current running kernel
using the ansible fact (ansible_kernel) so you install kernel-devel-{{
ansible_kernel }} and kernel-headers-{{ ansible_kernel }} (leave dkms unpinned),
ensuring the playbook gathers facts before this task and that the package list
in the ansible.builtin.dnf task uses those suffixed names so DKMS can compile
against the exact running kernel.
- Around line 14-17: In gpu-validation/tasks/setup.yaml the play installs
kernel-devel/kernel-headers without pinning to the running kernel (risking DKMS
build failures) and includes_role name cifmw.general.repo_setup with tasks_from:
rhos_release which may not exist; update the package installation step to
install kernel-devel-{{ ansible_kernel }} and kernel-headers-{{ ansible_kernel
}} (or otherwise ensure a reboot-before-DKMS flow) and verify/adjust the
include_role usage: confirm that the cifmw.general.repo_setup role actually
contains a rhos_release (or rhos_release.yml) task file and change tasks_from to
the correct task name or remove the tasks_from so the role loads its default
tasks to avoid runtime failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: e2df970f-d421-407d-ba63-fd58726c3169
📒 Files selected for processing (3)
gpu-validation/defaults/main.yamlgpu-validation/tasks/nvidia-cuda-repos.yamlgpu-validation/tasks/setup.yaml
💤 Files with no reviewable changes (1)
- gpu-validation/tasks/nvidia-cuda-repos.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- gpu-validation/defaults/main.yaml
| block: | ||
| - name: Run repo-setup for RHEL VMs | ||
| ansible.builtin.include_role: | ||
| name: cifmw.general.repo_setup |
There was a problem hiding this comment.
please do not make this depending on downstream cifmw.general.repo_setup 's rhos_release tool. Instead, provide a boostrap command variable which CI jobs can overload with rhos_release and repo enablement commands
There was a problem hiding this comment.
Can you explain more here? What would this look like?
e05f415 to
952d2fe
Compare
This PR adds validations for MIG configurations: - Add flavor configuration for MIG testing - Update MIG checks and NVIDIA validations to work on all GPU types - Add RHEL guest repository support - Add DKMS installation and reboot handling - Install NVIDIA repos with custom RPM - Make CUDA repo available for installation - Install nvidia-container-toolkit and configure CDI - Setup CDI and NVIDIA Management Library - Add GPU utilization option in vLLM service - Move CUDA install tasks to separate file - Remove A30-specific conditionals to support all GPU types with updated RHEL 9.6 drivers - Fix reboot sequencing for proper driver and CDI initialization Co-Authored-By: Claude Sonnet 4.5 Signed-off-by: Ronelle Landy <rlandy@redhat.com>
952d2fe to
4c52584
Compare
Add MIG validations and remove GPU-specific conditionals
This PR adds validations for MIG configurations:
Co-Authored-By: Claude Sonnet 4.5
Signed-off-by: Ronelle Landy rlandy@redhat.com